require( tidyverse )
## Loading required package: tidyverse
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tibble' was built under R version 3.4.1
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
require( data.table )
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
require( maps )
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
require( zipcode )
## Loading required package: zipcode
require( lme4 )
## Loading required package: lme4
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.1
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand

This dataset covers the period of 2000–2017 and includes 62560 records of fines levied against corporations relating to violations of regulations, from cases initiated by 43 federal regulatory agencies. I downloaded the data from the Good Jobs First “Violation Tracker”, using the search GUI with all the options set to <any>.

viol <- tbl_df( fread( "/Users/willpitchers/Documents/=Job_Applications_etc/Data_Incubator_2017/violation_tracker_export.csv" ))

names( viol ) <- gsub( " ", "_", names( viol ))

viol <- viol %>% mutate( Year=as.integer( Year ), Industry_code=factor( Industry_in_Record ), Civ_Crim=factor( `Civil/Criminal` ) ) %>%
                  mutate( HQ_State_of_Parent=factor( HQ_State_of_Parent ), HQ_Country_of_Parent=factor( HQ_Country_of_Parent ) ) %>%
                  mutate( Primary_Offense=factor( Primary_Offense ), Penalty_Amount=as.numeric( gsub( "[$,]", "", Penalty_Amount ) ) ) %>% 
                  mutate( Penalty_Adj=as.numeric( gsub( "[$,]", "", Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting )) ) %>% 
                  mutate( Subtraction_From_Penalty=as.numeric( "[$,]", "", Subtraction_From_Penalty ) ) %>% 
                  mutate( Agency=factor( Agency ), Secondary_Offense=factor( Secondary_Offense ), Ownership_Structure=factor( Ownership_Structure ) ) %>% 
                  mutate( Major_Industry_of_Parent=factor( Major_Industry_of_Parent ), Zip=factor( Zip ), Facility_State=factor( Facility_State ) )
## Warning in evalq(as.numeric("[$,]", "", Subtraction_From_Penalty),
## <environment>): NAs introduced by coercion
viol <- viol %>% mutate( Civ_Crim_bin=factor( ifelse( grepl( "civil and criminal", `Civil/Criminal` )=="TRUE", "both", 
                                              ifelse( grepl( "civil", `Civil/Criminal` )=="TRUE", "civil", "criminal" ))))

# str( viol )
# summary( viol$Year )
viol

The Size of Penalties

This dataset contains fields for both Penalty_Amount and Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting, but is clear that these adjustments make very little difference to the data as a whole, as these two variables are correlation at r=0.994. I have thus elected to use the adjusted penalty values for all these analyses.

The first pattern to note is that (perhaps predictably) the penalties imposed via criminal proceedings tend to be larger than those imposed via civil proceedings, and larger still when both civil and criminal proceedings have been brought.

viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Penalty_Adj )) + 
          geom_density( aes( col=Civ_Crim_bin, fill=Civ_Crim_bin ), alpha=.5 ) + 
          scale_x_log10() + 
          xlab( "Penalty (log10 $'s)" ) + 
          scale_fill_discrete( name="Type of\ncase brought") + 
          scale_colour_discrete( name="Type of\ncase brought" )

The way this pattern is built up is somewhat non-intuitive, as the linear model and boxplot below make clear. The groups mean are well-separated, and easy to distinguish statistically (small p-values), but there is so much variation within groups that the predictive value of the mean differences is small (low R2 value). The vast majority of penalties arise from civil actions – 99.3% of those recorded – and the mean value of these penalties is comparatively small.

summary( lm( Penalty_Adj ~ Civ_Crim_bin, data=viol ) )
## 
## Call:
## lm(formula = Penalty_Adj ~ Civ_Crim_bin, data = viol)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.913e+08 -4.644e+06 -4.640e+06 -4.627e+06  2.080e+10 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           596240136   24921654   23.93   <2e-16 ***
## Civ_Crim_bincivil    -591589764   24931682  -23.73   <2e-16 ***
## Civ_Crim_bincriminal -464209918   26451988  -17.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 176200000 on 62557 degrees of freedom
## Multiple R-squared:  0.01211,    Adjusted R-squared:  0.01207 
## F-statistic: 383.3 on 2 and 62557 DF,  p-value: < 2.2e-16
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Civ_Crim_bin, Penalty_Adj )) + 
                                        geom_point( colour="blue", alpha=0.3, position="jitter" ) + 
                                        geom_boxplot( outlier.size=0, alpha=0 ) + 
                                        coord_flip() + 
                                        xlab( "Type of case brought" ) +
                                        scale_y_log10() +
                                        ylab( "Penalty (log10 $'s)" )

Where are Violations Occurring?

Taking a broad-strokes view of the geographical data, we can see that there seems not to be a visually apparent trend in the locations where penalties are levied – this data appears to track pretty well with the locations of population centres, thought here may be more subtle patterns that are not visible at the nation-wide scale.

data( zipcode )
zipcode <- zipcode %>% mutate( zip=factor( zip ), region=substr( zip, 1, 1) )

full_join( viol, zipcode, by=c( "Zip" = "zip" ) ) %>% mutate( Zip=factor( Zip ) ) %>% filter( Civ_Crim_bin=="civil" ) %>% 
    ggplot() + geom_point( aes( x=longitude, y=latitude, col=Year ), cex=.5 ) + 
        theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL ) + 
        scale_y_continuous(limits = c(25,50), breaks = NULL ) + 
        labs(x=NULL, y=NULL)

However, if we look the number of penalties paid in each state/territory over the course of this 18-yr dataset, we can see that there are many more violations in some states that others, with West Virginia being responsible for 15.22% of all penalties levied.

viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Facility_State, Facility_State, function(x)-length(x) ) )) + 
                                              geom_bar( aes( fill=Major_Industry_of_Parent )) + 
                                              coord_flip() +
                                              theme( axis.text.y=element_text( hjust=0, size=9 ) ) +
                                              ylab( "no. penalties" ) + 
                                              xlab( "" ) + 
                                              scale_fill_discrete( name="Industrial Sector" )

The bars are colored by the industrial sector of the parent corporation found liable for the penalty – the blue that occupies 93.43% of the West Virginia bar represents corporations classified as “mining and minerals”. It is apparent that WV is unusual in both the number of violations and the number of those violations related to mining.

What are Corporations being Fined For?

Across the 49 industrial sectors represented, it is immediately clear from this barplot the extent to which mining & mineral corporations are over-represented (25.24% of all penalties). It is also clear that the pink areas – indicating ‘workplace safety or health’ violations – comprise the majority of violations in most sectors, but particularly so in mining.

viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x) ) )) + 
                                              geom_bar( aes( fill=Primary_Offense )) + 
                                              coord_flip() +
                                              theme( axis.text.y=element_text( hjust=0, size=9 ) ) +
                                              ylab( "no. penalties" ) + 
                                              xlab( "" ) + 
                                              scale_fill_discrete( name="Violation Categories" )

Take-Home Message

From these analyses so far, my preliminary recommendation would be that the primary focus of labor-safety lobbying efforts ought to be the West Virginian mining industry. This would also be a potentially fruitful focus for legislators and regulators, as the apparent abundance of regulatory violations would potentially allow for high statistical power in any empirical tests for the effects of regulatory/enforcement policy changes.

The over-representation of workplace health and safety violations is also noteworthy, and not limited to the mining sector, nor to West Virginia. Given the generality of this pattern, a potential ethical investor should be encouraged to investigate a corporation’s history of health & safety compliance as a priority.